Davide Bombacigno

Interested in solutions, and people

Analysis of videogames market, biggest market and best selling genres

We will analyze the sales data of the videogames market using the data about games with more than 100000 copies sold

source: https://www.kaggle.com/gregorut/videogamesales Code and data used: https://github.com/davbom97/source

I assume this data to be somewhat representative sample of the videogames market and I will use it to extract some insights

Import Data
Data Cleaning
Data Analysis
- Composition of Total Sales by Market
- Time series of Sales by Market
- Markets composition by Genre
- Analysis of Genre Sales
  - Genre sales boxplot
  - Which Genre sells the most per game?

import pandas as pd
import numpy as np
from IPython.display import display,Image
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.express as px

Import Data

data = pd.read_csv("vgsales.csv").sort_values("Global_Sales",ascending= False)
data = data.drop("Rank",axis = 1)

display(data.head(5))

	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
0	Wii Sports	Wii	2006.0	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74
1	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24
2	Mario Kart Wii	Wii	2008.0	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82
3	Wii Sports Resort	Wii	2009.0	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00
4	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37

The data reports for each game:

The game name (Unique for every game)
The main platform it was published (Categorical value)
Year it was bublished (Numeric discrete value)
Its genre (Categorical value)
Its publisher (Categorical value)
Sales by market (Numeric continuous value)

Entries are ranked by descending order of global sales

Data Cleaning

Because the data is incomplete from 2016 on, we drop all the entries beyond 2015

data = data[data["Year"] <= 2015]

Then we check if there are any NaN values in the data

data_na = data.isna().sum()
display(data_na)

Name             0
Platform         0
Year             0
Genre            0
Publisher       34
NA_Sales         0
EU_Sales         0
JP_Sales         0
Other_Sales      0
Global_Sales     0
dtype: int64

There seems to be some missing values in the “Publisher” category, may be due some minor publishers

Becouse there are only 34 missing values, I decided to delete those entries

data = data.dropna()

Data Analysis

Composition of Total Sales by Market

We group the sales by year and market to find which one contributes the most to the total sales

market_data = data[["Year","NA_Sales","EU_Sales","JP_Sales","Other_Sales"]].groupby("Year").sum()
display(market_data.head(5))

	NA_Sales	EU_Sales	JP_Sales	Other_Sales
Year
1980.0	10.59	0.67	0.00	0.12
1981.0	33.40	1.96	0.00	0.32
1982.0	26.92	1.65	0.00	0.31
1983.0	7.76	0.80	8.10	0.14
1984.0	33.28	2.10	14.27	0.70

There seems to be missing data of JP_Sales from 1980 to 1982, so we drop these years for the analysis

market_data = market_data[market_data.index >1982]

display(market_data.sum().reset_index())
fig = px.pie(market_data.sum().reset_index(),values = 0,names = "index")
fig.update_layout(title="Global Sales composition by Market")
img_bytes = fig.to_image(format="png")
Image(img_bytes)

	index	0
0	NA_Sales	4233.81
1	EU_Sales	2375.65
2	JP_Sales	1270.55
3	Other_Sales	780.39

From the data emerges that the biggest market is the North American one, making up to almost the 50% of the total sales, followed by the European and the Japanese ones.

Time series of Sales per Market

Now we want to see the time series of the total sales and the sales per market

fig = go.Figure()

fig.add_trace(go.Scatter(x = market_data.index, y= market_data.sum(axis = 1), mode = "lines+markers"))

fig.update_layout(title="Time Series of Total Sales",
                   xaxis_title="Total Sales (Mln)",
                   yaxis_title="Year")
img_bytes = fig.to_image(format="png")
Image(img_bytes)

png

fig = go.Figure()
fig.add_trace(go.Scatter(x = market_data.index, y= market_data["NA_Sales"], name = "NA Sales", mode = "lines+markers"))
fig.add_trace(go.Scatter(x = market_data.index, y= market_data["EU_Sales"], name = "EU Sales", mode = "lines+markers"))
fig.add_trace(go.Scatter(x = market_data.index, y= market_data["JP_Sales"], name = "JP Sales", mode = "lines+markers"))
fig.add_trace(go.Scatter(x = market_data.index, y= market_data["Other_Sales"], name = "Other Markets Sales", mode = "lines+markers"))
fig.add_trace(go.Scatter(x = market_data.index, y= market_data.sum(axis = 1), name = "Total Sales" ,mode = "lines+markers"))
fig.update_layout(title="Time Series of Market Sales",
                   xaxis_title="Total Sales (Mln)",
                   yaxis_title="Year")
img_bytes = fig.to_image(format="png")
Image(img_bytes)

png

market_sales_change = (market_data.pct_change()*100).round(2)
total_sales_change = (market_data.sum(axis = 1).pct_change()*100).round(2)

fig = make_subplots(rows=5, cols=1, shared_yaxes = True)
fig.add_trace(go.Scatter(x = market_sales_change.index, y= market_sales_change["NA_Sales"], name = "NA Sales", mode = "lines+markers"),row =1,col = 1)
fig.add_trace(go.Scatter(x = market_sales_change.index, y= market_sales_change["EU_Sales"], name = "EU Sales", mode = "lines+markers"),row =2,col = 1)
fig.add_trace(go.Scatter(x = market_sales_change.index, y= market_sales_change["JP_Sales"], name = "JP Sales", mode = "lines+markers"),row =3,col = 1)
fig.add_trace(go.Scatter(x = market_sales_change.index, y= market_sales_change["Other_Sales"], name = "Other Markets Sales", mode = "lines+markers"),row =4,col = 1)
fig.add_trace(go.Scatter(x = total_sales_change.index, y= total_sales_change, name = "Total Sales" ,mode = "lines+markers"),row =5,col = 1)
fig.update_layout(title="Rate of change of Market Sales",
                   xaxis_title="Year",
                   yaxis_title="Rate of change (%)",
                   height=800)
img_bytes = fig.to_image(format="png")
Image(img_bytes)

png

The data shows the growth of the videogames markets and the effects of the economic cycles and bubbles (the dot-com bubble in 2000, the crisis of 2008 on the NA sales and the European debt crisis of 2010-2011 on the EU markets).

In general, the market peaked in 2008, right before the economi crisis of those years, while showing a down trend in 2015.

While it is obvious the impact of the economic cycle on the sales, it must be noted that some changes in the industry might have shifted the focus of the companies and the customers(Microtransactions, DLCs, Free to Play business models, etc), indicating that the reduction of sales does not mean that the industry is in a bad shape, but just that now there are more revenue sources for the companies besides game sales than in the past.

Looking at the average growth rates of each market:

display(market_sales_change.mean().round(2))

NA_Sales       28.72
EU_Sales       38.28
JP_Sales        7.88
Other_Sales    46.86
dtype: float64

And the total sales growth rate:

display("The total sales average growth rate is : %.2f" % total_sales_change.mean())

'The total sales average growth rate is : 19.10'

Markets composition by Genre

Now we can see which are the best selling genres and platforms for each market using a heatmap:

market_sales_genre = data.drop(["Global_Sales","Year"], axis = 1).groupby("Genre").sum()
market_sales_genre = (market_sales_genre.div(market_sales_genre.sum(axis = 0),axis = 1)*100).round(2)
fig = ff.create_annotated_heatmap(z = market_sales_genre.values,
                                  y = market_sales_genre.index.values.tolist(),
                                  x = market_sales_genre.columns.values.tolist())
fig.update_layout(title="Composition of each market by genre (%)")
img_bytes = fig.to_image(format="png")
Image(img_bytes)

png

print("The top 3 selling genres in the NA market are:")
print(market_sales_genre["NA_Sales"].sort_values(ascending = False)[:3])

print("\nThe top 3 selling genres in the EU market are:")
print(market_sales_genre["EU_Sales"].sort_values(ascending = False)[:3])

print("\nThe top 3 selling genres in the JP market are:")
print(market_sales_genre["JP_Sales"].sort_values(ascending = False)[:3])

print("\nThe top 3 selling genres in other markets are:")
print(market_sales_genre["Other_Sales"].sort_values(ascending = False)[:3])

The top 3 selling genres in the NA market are:
Genre
Action     19.88
Sports     15.46
Shooter    13.19
Name: NA_Sales, dtype: float64

The top 3 selling genres in the EU market are:
Genre
Action     21.43
Sports     15.29
Shooter    12.72
Name: EU_Sales, dtype: float64

The top 3 selling genres in the JP market are:
Genre
Role-Playing    27.28
Action          12.03
Sports          10.55
Name: JP_Sales, dtype: float64

The top 3 selling genres in other markets are:
Genre
Action     23.44
Sports     16.74
Shooter    12.74
Name: Other_Sales, dtype: float64

Composition of Total Sales by Genre

Like we did for the markets, we group the sales by Genre and Year:

genre_sales = data[["Year","Genre","Global_Sales"]].groupby(["Year","Genre"]).sum().unstack(level=-1).droplevel(0,axis = 1)
display(genre_sales.head(12))

Genre	Action	Adventure	Fighting	Misc	Platform	Puzzle	Racing	Role-Playing	Shooter	Simulation	Sports	Strategy
Year
1980.0	0.34	NaN	0.77	2.71	NaN	NaN	NaN	NaN	7.07	NaN	0.49	NaN
1981.0	14.84	NaN	NaN	NaN	6.93	2.24	0.48	NaN	10.04	0.45	0.79	NaN
1982.0	6.52	NaN	NaN	0.87	5.03	10.03	1.57	NaN	3.79	NaN	1.05	NaN
1983.0	2.86	0.40	NaN	2.14	6.93	0.78	NaN	NaN	0.48	NaN	3.20	NaN
1984.0	1.85	NaN	NaN	1.45	0.69	3.14	5.95	NaN	31.10	NaN	6.18	NaN
1985.0	3.52	NaN	1.05	NaN	43.17	3.21	NaN	NaN	1.00	0.03	1.96	NaN
1986.0	13.74	NaN	NaN	NaN	9.39	NaN	1.96	2.52	3.89	NaN	5.57	NaN
1987.0	1.12	4.38	5.42	NaN	1.74	NaN	NaN	4.65	0.71	NaN	3.72	NaN
1988.0	1.75	NaN	NaN	NaN	27.73	5.58	2.14	5.88	0.51	0.03	3.60	NaN
1989.0	4.64	NaN	NaN	1.28	20.66	37.75	NaN	2.20	1.20	NaN	5.72	NaN
1990.0	6.39	NaN	NaN	NaN	22.97	6.00	6.26	4.52	NaN	1.14	2.11	NaN
1991.0	6.76	2.24	0.39	0.08	7.64	3.24	1.14	3.25	2.00	2.14	2.41	0.94

There are missing values (probably due to the genre not being defined at the time, or not existing), so we delete all the entries before 1991.

genre_sales = genre_sales[genre_sales.index >= 1991]

fig = px.pie(genre_sales.sum(),values = 0,names = genre_sales.columns)
fig.update_layout(title="Global Sales composition by Genre")
img_bytes = fig.to_image(format="png")
Image(img_bytes)

png

Here we can see that the best selling Genres are:

Action
Sports
Shooter and Role-Playing

Analysis of Genre Sales

We want to ask ourselves “Which Genres are, on average, the best selling?”

Looking at the total sales does not tell us the full story, because the most of sales are done only by few very big hits and are dependent on the count of games relased:

data_1991_on = data[data["Year"]>=1991]
genre_sales_desc = data_1991_on[["Genre","Global_Sales"]].groupby("Genre").describe().droplevel(0,axis = 1)
genre_sales_desc = genre_sales_desc.sort_values(by =["50%"],axis = 0,ascending = False)
display(genre_sales_desc)

	count	mean	std	min	25%	50%	75%	max
Genre
Platform	829.0	0.822461	2.005954	0.01	0.090	0.250	0.6900	30.01
Shooter	1220.0	0.777205	1.692879	0.01	0.080	0.220	0.7100	14.76
Sports	2240.0	0.562612	2.127120	0.01	0.090	0.220	0.5525	82.74
Fighting	818.0	0.529279	0.960339	0.01	0.080	0.210	0.5500	13.04
Action	3063.0	0.537173	1.185337	0.01	0.070	0.190	0.5000	21.40
Racing	1195.0	0.591431	1.689209	0.01	0.075	0.190	0.5300	35.82
Role-Playing	1417.0	0.633211	1.741657	0.01	0.070	0.190	0.5300	31.37
Misc	1660.0	0.470030	1.340625	0.01	0.060	0.160	0.4100	29.02
Simulation	834.0	0.464808	1.216313	0.01	0.060	0.160	0.4300	24.76
Puzzle	549.0	0.315993	0.839252	0.01	0.040	0.100	0.2700	15.30
Strategy	660.0	0.261773	0.527860	0.01	0.040	0.095	0.2800	5.45
Adventure	1239.0	0.184036	0.503521	0.01	0.020	0.060	0.1600	11.18

Looking at the histogram of the distribution of global sales for all games:

fig = px.histogram(data_1991_on["Global_Sales"], x="Global_Sales", nbins=50)
fig.update_layout(title="Histogram of the distribution of Global Sales",
                   xaxis_title="Global Sales")
img_bytes = fig.to_image(format="png")
Image(img_bytes)

png

Calculating the 95th percentile:

print("The 95th percentile is equal to : %f" % (data_1991_on["Global_Sales"].quantile(0.95)))

The 95th percentile is equal to : 2.008500

Meaning that the 95% of games in this sample made less than 2 milion copies sold

relative_total = 0
i = 0
total = data_1991_on["Global_Sales"].sum()
sales = data_1991_on[["Global_Sales"]]
pareto = []
while relative_total <= 0.8:
    game_sales_perc = data_1991_on["Global_Sales"].iloc[i]/total
    relative_total += game_sales_perc
    i+=1
print("The %f%% of games makes up for the %f%% of sales" % (i/len(sales)*100, relative_total*100))

The 25.572373% of games makes up for the 80.003392% of sales

Another evidence of the skewness is that just around 25% percent of games make the 80% total sales

Genre Sales boxplot

Within every Genre there is a huge variance, in support of our hypothesis we can see that the median sales are lower than the average sales, it is even more evident with a boxplot:

fig = px.box(data_1991_on, y="Global_Sales",x = "Genre")
img_bytes = fig.to_image(format="png")
Image(img_bytes)

png

The outliers (best selling games) make the boxplot unreadable, zooming in the boxes:

Image(filename="boxplot_zoomed.png") 

png

Which Genre sells the most per game?

To answer this question we have to take in account the variability within each genre, thus we will evaluate the best selling genre per game using the quartile coefficient of dispersion:

genre_sales_qcd = ((genre_sales_desc["75%"]-genre_sales_desc["25%"])/(genre_sales_desc["75%"]+genre_sales_desc["25%"])).sort_values(ascending = False)
print("The quartile coefficient of dispersion for each Genre is:")
display(genre_sales_qcd)

fig = go.Figure(go.Bar(x = genre_sales_qcd.index, y = genre_sales_qcd.values, text = genre_sales_qcd.round(2),textposition='outside'))
fig.update_layout(title="Quartile coefficient of dispersion",
                   yaxis_title="qcd",
                   height=500)
img_bytes = fig.to_image(format="png")
Image(img_bytes)

The quartile coefficient of dispersion for each Genre is:



Genre
Shooter         0.797468
Adventure       0.777778
Platform        0.769231
Role-Playing    0.766667
Simulation      0.755102
Action          0.754386
Racing          0.752066
Strategy        0.750000
Fighting        0.746032
Misc            0.744681
Puzzle          0.741935
Sports          0.719844
dtype: float64

png

Note: When comparing two genres, the one with the less coefficient of variation has lower variability between each game

Now, as written above, on average the best selling genres are:

Platform
Shooter
Sports

for genre in genre_sales_desc.index[:3]:
    print("The Genre %s qcv: %f median : %f" %(genre, genre_sales_qcd.loc[genre],genre_sales_desc.loc[genre]["50%"]))

The Genre Platform qcv: 0.769231 median : 0.250000
The Genre Shooter qcv: 0.797468 median : 0.220000
The Genre Sports qcv: 0.719844 median : 0.220000

Conclusions

Among the videogames markets, the NA is the biggest one, followed by EU and JP

The most sold Genres are Action, Sports, Shooter, and make up a significant portion of videogames sales in each market, with the exeption of Role-Playing, which is sold a lot in the JP market, making up the 27.3% of sales alone in that market.

Looking at the median copies sold by each game within a genre, We have that the Platform, Shooter and Sports games are on average the best selling, the median of the Platform genre in bigger than the Shooter one, while having a smaller quartile coefficient of dispersion, meaning that it has a lower variability of copies sold from game to game, meaning that overall the Platform Genre is better selling than the Shooting Genre.

The Shooter Genre has a equal median and a bigger qcd than the Sports, meaning that they sell on average the same, but the Shooter has more variability in sales, thus concluding that the top three selling Genres, on average, are in order:

Platform
Sports
Shooter

2020 3
2019 1

2019

Single facility location problem

6 minute read

In this post we will find out how to choose the best location for a single uncapacitated facility given a set of points of interest